Social network analysis of web links to eliminate false positives in collaborative anti-spam systems

نویسندگان

  • Zac Sadan
  • David G. Schwartz
چکیده

The performance of today’s email anti-spam systems is primarily measured by the percentage of false positives (non-spam messages detected as spam) rather than by the percentage of false negatives (real spam messages left unblocked). One reliable anti-spam technique is the Universal Resource Locator (URL)-based filter, which is utilized by most collaborative signature-based filters. URL-based filters examine URL frequency in incoming email and block bulk email when a predetermined threshold is passed. However, this can cause erroneous blocking of mass distribution of legitimate emails. Therefore, URL-based methods are limited in sufficient prevention of false positives, and finding solutions to eliminate this problem is critical for anti-spam systems. We present a complementary technique for URL-based filters, which uses the betweenness of web-page hostnames to prevent the erroneous blocking of legitimate hosts. The technique described was tested on a corpus of 10,000 random domains selected from the URIBL white and black list databases. We generated the appropriate linked network for each domain and calculated its centrality betweenness. We found that betweenness centrality of whitelist domains is significantly higher than that of blacklist domains. Results clearly show that the betweenness centrality metric can be a powerful and effective complementary tool for URL-based anti-spam systems. It can achieve a high level of accuracy in determining legitimate hostnames and thus significantly reduce false positives in these systems. & 2011 Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Introducing Social Trust to Collaborative Spam Mitigation

We propose SocialFilter, a trust-aware collaborative spam mitigation system. SocialFilter enables nodes with no email classification functionality to query the network on whether a host is a spammer. It employs Sybil-resilient trust inference to weigh the reports concerning spamming hosts that collaborating spam-detecting nodes (reporters) submit to the system. It weighs the spam reports accord...

متن کامل

Link-Based Characterization and Detection of Web Spam

We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. ...

متن کامل

SIGIR 2006 Workshop on Adversarial Information Retrieval on the Web AIRWeb 2006

We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. ...

متن کامل

Survey on Internet Spam: Classification and Analysis

In recent years spam detection and dealing with spam in information retrieval systems is a difficult task. This article surveys the classification of various spam in the internet based on their properties. The impact of various spams in social networks, email, image, content and links is discussed and the technique applied to prevent the spam in various areas is listed. A detailed analyzes of c...

متن کامل

Incremental Immune-Inspired Clustering Approach to Behavior-Based Anti-Spam Technology

Facing new type of challenge which maintain clusters in a dynamic web environment with a high volume of updates and costly re-clustering, the paper describes a novel behavior-based anti-Spam technology based on incremental immune-inspired clustering algorithm. we use an “internal image” network to represent the input data set in order to reduce data redundancy, whilst at the same time extractin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Network and Computer Applications

دوره 34  شماره 

صفحات  -

تاریخ انتشار 2011